BERT: Pre-training of Deep Bidirectional Transformers
Contents
BERT: Pre-training of Deep Bidirectional Transformers#

The year 2018 marked a turning point for the field of Natural Language Processing (NLP).
The BERT [Devlin et al., 2018] paper introduced a new language representation model that outperformed all previous models on a wide range of NLP tasks.
BERT is a deep bidirectional transformer model that is pre-trained on a large corpus of unlabeled text.
The model is trained to predict masked words in a sentence and is also trained to predict the next sentence in a sequence of sentences.
The pre-trained model can then be fine-tuned on a variety of downstream NLP tasks with state-of-the-art results.
BERT builds on two key ideas:
The transformer architecture [Vaswani et al., 2017]
Unsupervised pre-training
BERT is pre-trained on a large corpus of unlabeled text. Its weights are learned by predicting masked words in a sentence and predicting the next sentence in a sequence of sentences.
BERT is a (multi-headed) beast
BERT is a deep bidirectional transformer model. It is a multi-headed beast with 12(24) layers, 12(16) attention heads, and 110 million parameters. Since model weights are not shared across layers, the total number of different attention weights is 12(24) x 12(16) = 144(384).
Visualizing BERT#
Because of BERT’s complexity, it is difficult to understand the meaning of its learned weights intuitively. To help with this, we can visualize the attention weights of BERT’s self-attention layers.
%pip install bertviz
%config InlineBackend.figure_format='retina'
from bertviz import model_view, head_view
from transformers import AutoTokenizer, AutoModel, utils
utils.logging.set_verbosity_error() # Suppress standard warnings
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased", output_attentions=True)
inputs = tokenizer.encode("The cat sat on the mat", return_tensors='pt')
outputs = model(inputs)
attention = outputs[-1] # Output includes attention weights when output_attentions=True
tokens = tokenizer.convert_ids_to_tokens(inputs[0])
head_view(attention, tokens)
The tool visualizes attention as lines connecting the position being updated (left) with the position being attended to (right).
Colors identify the corresponding attention head(s), while line thickness reflects the attention score.
At the top of the visualization, you can select the model layer and the attention head(s) to visualize.
What does BERT actually learn?#
Let’s explore the attention patterns of various layers of the BERT (the BERT-Base, uncased version).
Sentence A: I went to the store.
Sentence B: At the store, I bought fresh strawberries.
BERT uses WordPiece tokenization and inserts special classifier ([CLS]) and separator ([SEP]) tokens, so the actual input sequence is:
[CLS] I went to the store . [SEP] At the store , I bought fresh straw ##berries . [SEP]
inputs = tokenizer.encode(
["I went to the store.", "At the store, I bought fresh strawberries."],
return_tensors="pt",
)
outputs = model(inputs)
attention = outputs[-1]
tokens = tokenizer.convert_ids_to_tokens(inputs[0])
Pattern 1: Attention to next word#
See an example for layer 2, head 0. (The selected head is indicated by the highlighted square in the color bar at the top.) Most of the attention at a particular position is directed to the next token in the sequence.
If you do not select any token, the visualization shows the attention pattern for all tokens in the sequence.
If you select a token, the visualization shows the attention pattern for the selected token.
If you select a token
i, virtually all the attention is directed to the next tokenwent.The [SEP] token disrupts the next-token attention pattern, as most of the attention from [SEP] is directed to [CLS] (the first token in the sequence) rather than the next token.
This pattern, attention to the next token, appears to work primarily within a sentence.
This pattern is related to the idea of a recurrent neural network (RNN) that is trained to predict the next word in a sequence.
head_view(attention, tokens, layer=2, heads=[0])
Pattern 2: Attention to previous word#
See an example for layer 6, head 11. In this pattern, much of the attention is directed to the previous token in the sequence.
For example, most of the attention from
wentis directed to the previous tokeni.The pattern is not as distinct as the next-token pattern, but it is still present.
Some attention is also dispersed to other tokens in the sequence, especially to the [SEP] token.
This pattern is also related to the idea of an RNN, in this case the forward direction of an RNN.
head_view(attention, tokens, layer=6, heads=[11])